DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2017.7

Fast Spaced Seed Hashing

Authors: Samuele Girotto, Matteo Comin, and Cinzia Pizzi

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed.

Cite as

Samuele Girotto, Matteo Comin, and Cinzia Pizzi. Fast Spaced Seed Hashing. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{girotto_et_al:LIPIcs.WABI.2017.7,
  author =	{Girotto, Samuele and Comin, Matteo and Pizzi, Cinzia},
  title =	{{Fast Spaced Seed Hashing}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{7:1--7:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.7},
  URN =		{urn:nbn:de:0030-drops-76501},
  doi =		{10.4230/LIPIcs.WABI.2017.7},
  annote =	{Keywords: k-mers, spaced seeds, efficient hashing}
}

Document

DOI: 10.4230/DagSemProc.10231.7

Remote Homology Detection of Protein Sequences

Authors: Matteo Comin and Davide Verzotto

Published in: Dagstuhl Seminar Proceedings, Volume 10231, Structure Discovery in Biology: Motifs, Networks & Phylogenies (2010)

Abstract

The classification of protein sequences using string kernels provides valuable insights for protein function prediction. Almost all string kernels are based on patterns that are not independent, and therefore the associated scores are obtained using a set of redundant features. In this talk we will discuss how a class of patterns, called Irredundant, is specifically designed to address this issue. Loosely speaking the set of Irredundant patterns is the smallest class of independent patterns that can describe all patterns in a string. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that Irredundant Class outperforms most of the string kernel methods previously proposed, and it achieves results as good as the current state-of-the-art methods with a fewer number of patterns. Unfortunately we show that the information carried by the irredundant patterns can not be easily interpreted, thus alternative notions are needed.

Cite as

Matteo Comin and Davide Verzotto. Remote Homology Detection of Protein Sequences. In Structure Discovery in Biology: Motifs, Networks & Phylogenies. Dagstuhl Seminar Proceedings, Volume 10231, pp. 1-20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2010)

Copy BibTex To Clipboard

@InProceedings{comin_et_al:DagSemProc.10231.7,
  author =	{Comin, Matteo and Verzotto, Davide},
  title =	{{Remote Homology Detection of Protein Sequences}},
  booktitle =	{Structure Discovery in Biology: Motifs, Networks \& Phylogenies},
  pages =	{1--20},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2010},
  volume =	{10231},
  editor =	{Alberto Apostolico and Andreas Dress and Laxmi Parida},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.10231.7},
  URN =		{urn:nbn:de:0030-drops-27419},
  doi =		{10.4230/DagSemProc.10231.7},
  annote =	{Keywords: Classification of protein sequences, irredundant patterns}
}

Search Results

Documents authored by Comin, Matteo

Fast Spaced Seed Hashing

Abstract

Cite as

Remote Homology Detection of Protein Sequences

Abstract

Cite as

Thanks for your feedback!

Could not send message